{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "09814d84",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/metadata_extraction/PydanticExtractor.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e9f1d0d-1c85-4760-be5d-665fe98da389",
   "metadata": {},
   "source": [
    "# Pydantic Extractor\n",
    "\n",
    "Here we test out the capabilities of our `PydanticProgramExtractor` - being able to extract out an entire Pydantic object using an LLM (either a standard text completion LLM or a function calling LLM).\n",
    "\n",
    "The advantage of this over using a \"single\" metadata extractor is that we can extract multiple entities with a single LLM call."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3fb90eb-96e5-4c33-9fa4-67f7bbb5770d",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6efa88c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-readers-web\n",
    "%pip install llama-index-program-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d8a3548-c108-44df-9e92-6a7c6fc803ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "import os\n",
    "import openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e441d755-481f-452d-bbc3-eb14475f2139",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n",
    "openai.api_key = os.getenv(\"OPENAI_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c527b56-f013-48d9-b587-7aec85485714",
   "metadata": {},
   "source": [
    "### Setup the Pydantic Model\n",
    "\n",
    "Here we define a basic structured schema that we want to extract. It contains:\n",
    "\n",
    "- entities: unique entities in a text chunk\n",
    "- summary: a concise summary of the text chunk\n",
    "- contains_number: whether the chunk contains numbers\n",
    "\n",
    "This is obviously a toy schema. We'd encourage you to be creative about the type of metadata you'd want to extract! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fac11dd-3dc0-4d4b-b3ef-8854fae4bad7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "from typing import List"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7136d0b-cbf7-493b-9ab7-d8c16478eea8",
   "metadata": {},
   "outputs": [],
   "source": [
    "class NodeMetadata(BaseModel):\n",
    "    \"\"\"Node metadata.\"\"\"\n",
    "\n",
    "    entities: List[str] = Field(\n",
    "        ..., description=\"Unique entities in this text chunk.\"\n",
    "    )\n",
    "    summary: str = Field(\n",
    "        ..., description=\"A concise summary of this text chunk.\"\n",
    "    )\n",
    "    contains_number: bool = Field(\n",
    "        ...,\n",
    "        description=(\n",
    "            \"Whether the text chunk contains any numbers (ints, floats, etc.)\"\n",
    "        ),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c3b8f4c-56d1-4703-a8d9-2ea4fa162b9e",
   "metadata": {},
   "source": [
    "### Setup the Extractor\n",
    "\n",
    "Here we setup the metadata extractor. Note that we provide the prompt template for visibility into what's going on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cca1d254-d4b3-4ae4-a0a4-2444d847b6b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.program.openai import OpenAIPydanticProgram\n",
    "from llama_index.core.extractors import PydanticProgramExtractor\n",
    "\n",
    "EXTRACT_TEMPLATE_STR = \"\"\"\\\n",
    "Here is the content of the section:\n",
    "----------------\n",
    "{context_str}\n",
    "----------------\n",
    "Given the contextual information, extract out a {class_name} object.\\\n",
    "\"\"\"\n",
    "\n",
    "openai_program = OpenAIPydanticProgram.from_defaults(\n",
    "    output_cls=NodeMetadata,\n",
    "    prompt_template_str=\"{input}\",\n",
    "    # extract_template_str=EXTRACT_TEMPLATE_STR\n",
    ")\n",
    "\n",
    "program_extractor = PydanticProgramExtractor(\n",
    "    program=openai_program, input_key=\"input\", show_progress=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6cd2564-8beb-478f-873c-00a369e60097",
   "metadata": {},
   "source": [
    "### Load in Data\n",
    "\n",
    "We load in Eugene's essay (https://eugeneyan.com/writing/llm-patterns/) using our LlamaHub SimpleWebPageReader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30631ac3-fd93-42eb-a7e4-258d46999ebd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load in blog\n",
    "\n",
    "from llama_index.readers.web import SimpleWebPageReader\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "\n",
    "reader = SimpleWebPageReader(html_to_text=True)\n",
    "docs = reader.load_data(urls=[\"https://eugeneyan.com/writing/llm-patterns/\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be14ef46-04b5-4cea-b240-ced80641f869",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.ingestion import IngestionPipeline\n",
    "\n",
    "node_parser = SentenceSplitter(chunk_size=1024)\n",
    "\n",
    "pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])\n",
    "\n",
    "orig_nodes = pipeline.run(documents=docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2769efd-b34a-4df4-85ef-f2de84ce061b",
   "metadata": {},
   "outputs": [],
   "source": [
    "orig_nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad8ebec7-6f8c-464b-b075-997be0c814a7",
   "metadata": {},
   "source": [
    "## Extract Metadata\n",
    "\n",
    "Now that we've setup the metadata extractor and the data, we're ready to extract some metadata! \n",
    "\n",
    "We see that the pydantic feature extractor is able to extract *all* metadata from a given chunk in a single LLM call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "951979b2-418d-4ad6-9c43-f50580c9944d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/json": {
       "ascii": false,
       "bar_format": null,
       "colour": null,
       "elapsed": 0.005166053771972656,
       "initial": 0,
       "n": 0,
       "ncols": null,
       "nrows": 37,
       "postfix": null,
       "prefix": "Extracting Pydantic object",
       "rate": null,
       "total": 1,
       "unit": "it",
       "unit_divisor": 1000,
       "unit_scale": false
      },
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5b7dab3b40694d508549e25f20fe5048",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Extracting Pydantic object:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sample_entry = program_extractor.extract(orig_nodes[0:1])[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd23ab40-8d41-44a7-ab15-037336ad2086",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'entities': ['eugeneyan', 'HackerNews', 'Karpathy'],\n",
       " 'summary': 'This section discusses practical patterns for integrating large language models (LLMs) into systems & products. It introduces seven key patterns and provides information on evaluations and benchmarks in the field of language modeling.',\n",
       " 'contains_number': True}"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(sample_entry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2081326-1a6b-4f73-8a84-4850eea9b5df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/json": {
       "ascii": false,
       "bar_format": null,
       "colour": null,
       "elapsed": 0.004650115966796875,
       "initial": 0,
       "n": 0,
       "ncols": null,
       "nrows": 37,
       "postfix": null,
       "prefix": "Extracting Pydantic object",
       "rate": null,
       "total": 29,
       "unit": "it",
       "unit_divisor": 1000,
       "unit_scale": false
      },
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9829cc571df447d5a0be6c349bc036e0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Extracting Pydantic object:   0%|          | 0/29 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "new_nodes = program_extractor.process_nodes(orig_nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac2b984e-8da5-46cc-9afe-dc10044bc01a",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(new_nodes[5:7])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
